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ABSTRACT ^ A theory af eoHy and intermediate visual (nform&tlof) processing is 
giver*, which extends to about the level of figure-ground separation. Its core Is a 
computational theory of texture vision. Evidence obtained from perceptual and 
from computational experiments is adduced in its support A consequence of the 
theory is that high-level knowledge about the world influences visual processing- 
later end in a dill! Brent way (mm that Currently practiced in machine vision. 
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Summary 

Understanding hew the visual cortex analyzes natural images It 
ene goal of visual neurophysiology. At some Stage* we need to confront the 
information processing problems that are involved. A series of ■computational 
experiments on natural images was therefore undertaken* and a visual pre¬ 
processor emerged with the (allowing structure; 

U) Approximations to the first and second directional derivatives of intensity ere 
measured everywhere. They are computed by convolving the image with "edge- 
shaped" and "bar-shaped" masks. 

( 2 ) These imeasurements are parsed into an orientation-dependent description: -of 
the intensity changes present in the image. The parsing process consists ol 
discovering and matching peaks and troughs in the measurements* end roughly 
classifying local patterns of peaks into EOCfS, LINE?, SHADING, etc. 
t3) The descriptions obtained at each orientation ere combined, termination points 
of edges ere discovered, and small blobs are Isolated and described 

This pre-processor computes what is ceiled the primal sketch of 
en image, but for most images it is large and unwieldy. By examining our ability to 
Interpret certain simple drawings, it is demonstrated that a variety of abstract 
grouping processes and related fao lilies are present in our visual systems. It is 
shown how, if applied to the primal sketch, these processes are capable of 
successfully analyzing many kinds of visual texture and of extracting perceived 
"figure" from ground, It is conjee tyred that these operations can account for the 
entire range of texture discriminant:ns of which we are capable, and the analysis 
a I several real images is given in its support, The conjecture relegates the 
Influence of higher-level knowledge on visual processing to a much later stage than 
ie currently found in machine vision programs* end it implies that such knowledge 
should influence the control of* rather than the actual computations in* the earlier 
stages of analysis. 



Preface 


, V w , ork ot ^ QW of Mountewtl* ( 1 357}, of Lettein «t 

■L (! 959), and of Hubal and Wierel f 19625 initiated what fs widely regarded asT 
breakthrough in visual neurophysiology, But despite the subsequent accumulation 
of a wealth of anatomical and physiologital information about the mammalian visual 
vertex, our knowledge of its information processing function, or even of how 
difficult are the problems that it solves, remains rudimentary. 

This is no accident, Physiology has always, been concerned with 
how argap' 5 ms work. Its goals are to unravel the local mechanisms within en 
organism and to understand th&ir place in the functioning of the animal as a whole. 
While the concerns of physiology lay with mechanical, or even with chemical or 
physical phenomanaj the physiology's background knowledge and everyday 
experience sufficed to provide him with the necessary insight into function. As 
physiology has turned to information processing problems, however, 
neurophysiologists have last the reliable background intuition that has been 
fundamental to the success of the discipline in Bhe past. The situation in modern 
neurophysiology is I hat people are trying to understand how a particular 
mechanism performs a computation that they cannot even formulate, let alone 
provide a crisp summary of ways ot doing. To rectify the situation, we need to 
invest considerable effort in studying the computational background to questions 
that can be approached in neurophysiological experiments. 

Therefore, although the work described here arises from a deep 
commitment to the goals of neurophysiology, (he work Is not about neurophysiology 
directly, nor is it about simulating neurophysiological mechanisms: it is about 
studying vision, ft amounts to a series of computational experiment^ inspired in 
part by some findings in visual neurophysiology. The need for them arises 
because, until one tries to process an image or to make an artificial arm thread a 
needle, one has tittle idea of the problem? that really arise in trying to do these 
things. Computational experiments allow one to study in detail what combi nation of 
factors causes a method, or group of methods, to succeed or fail in a number of 
particular circumstances that originate from real-world data. The power of this 
approach is that the knowledge one obtains concerns facte that are inherent in the 
task, not in the structural details of the mechanism performing it Such knowledge! 
is a vital prerequisite for understanding mammalian visual systems fully* and it is 
knowledge that cannot be obtained in any other way, 


Introduction 


The V|smri problem begins with $ large gray-level intensity array, 
and culminates in a description that depend? on that array, and on the purpoasi far 
which it is being viewed. The question of interact is what has to go on in- 
between, In this article. We shall restrict our attention to single frame, 
monochromatic, mwwcdar images without specular!ties, reflections, transfusency 
transparency, or tight sources; and we shall study soma of the problems that arise 
in understanding early and Intermediate levels of visual information processing, 

Perhaps the best way of introducing the topic h to pose soma 

questions; 

0) Whet is early visual processing for? 

(2) Hew much of visual information processing can proceed using purely data^ 
driven techniques? ? 

f3> At what level and by what mechanisms may texture vision and figure-ground 
phenomena be implemented? ” 

When does higher level knowledge about th# world have to begin interacting 
with purely data-driven processes? 

(5) When and how doe? purpose have to influence what computations are made 
on an image? 


Recent work in computer vision has tried to involve high-level 
knowledge about the world at a very early stage in tto processing (Shirai 1974, 
Freuder 1 975). The main motivations for this have been that it has proved very 
di'ficuit to extract abject boundaries from intensity arrays, and that strategic 
deployment of high-level knowledge about a «ene can sometimes greatly reduce 
the computational- effort required far primary image processing, This article 
opposes this trend, and makes three main arguments. The first argument consist* 
of a demonstration that a very great deal of information may in fact be extracted 
from an imago using knowledge-free techniques. The price one pays for this Is 
prodigious computing power, and it involves programs that are considerably mors 
complex than feature-point detecting routines. There can, however, be little doubt 
that our own visual systems do in fact possess enormous power (Thomas and - 
Binford 1974, p 16). The second argument is that deciding what a low-level visual 
processor dan and cannot deliver is 8 pre-requisite for useful research into 
higher-iever problems of recognition For example, (he problem of recognizing 
and interpreting a scene has e very different flavor in vision systems with rtdfo and 
with poor pre-processing abilities. The difference is almost as extreme as trying 
to- make sense out of an English sentence with and without the benefit of e 
knowledge of English syntax. Hence, unless me has a firm idee about whet pre- 


processing is possible, one is In danger of expending effort on problems that, ip a 
real sense, are net problems al ell. The third argument is that our own perceptual 
apparatus probably contains a rich pre-processing ability, Hence if machine vision 
intends to say anything useful about those computations, it had better examine the 
lower problems first, and study the later ones when the peripheral processing has 
been solved. Otherwise one m conducting research without the benefit of date on 
which to test one's conclusions. This amounts to e reckless abandoning ot 
precisely the new experimental tools that computer technology has made available, 
namely the ability to decide whether a computational theory successfully 
addresses the problems that arise in real-world data. 

TN$ article presents a theory of visual processing for Its chosen 
class of images up to about the level ol the figure-ground problem. Its ream focus 
is a new computational theory of texture vision. The article gives a sufficient 
number of examples of processed images to establish that the theory is not 
obviously inadequate. The detailed and lengthy arguments that make a positive 
case for adequacy will appear elsewhere (Marr 1976)., The argument is quite 
protracted, and relies on several main steps. Its overall thrust is that the first 
step of consequence in visual inlormation processing is to compute a primal 
description of the image, and that all subsequent compulations are implemented as 
manipulations of that description, In order that the reader may follow with ease 
the stages in the argument, 1 summarize the main steps here: 

(U The function of early visual processing is to compute a description of the 
gray-level changes present In an image in terms of 0 vocabulary of gray-level 
change primitives. These primitives consist oF straight contour segments of 
various kinds (SHADING-EDGE, EXTENDED-EDGE, etc,), LINE*, BLOBs, end of various 
parameters bound to them such as FUZZINESS, CONTRAST or LIGHTNESS, POSITION, 
ORIENTATION, simple measures of their SIZE, and a specification of their 
TERMINATION point*. This primitive description is obtained Irem the intensity 
array by knowledge-free techniques, and it is called the PRIMAL SKETCH, It 
differs from an array of feature points in a subtle way, which is explained In tht 
text 

(3) From our ability to interpret drawings, one may infer the presence in our 
perceptual equipment of symbolic processes that are capable of grouping lines, 
paints, and blobs together in various ways. Non-symbolic techniques, like 
examining the power spectrum ol the spatial Fourier transform of the drawings, 
cennot account for these grouping phenomena, since the groupings are performed 
by mechanisms ot construction rather than mechanisms of detection. 

(3) For most images, the primal sketch is large and unwieldy. It can however be 
capably analyzed by a mechanism that has available the symbolic processes 


discovered in step (2), together with the ability to select Items out of the primHfl 
sketch on the basis of first-order discriminations acting an the principal 
parameters, Hence, it is argued, texture virion rests on grouping operations and 
first-order discriminations operating or the primal sketch, rather than on saeond 
order operations operating on the intensity array as suggested by Ju'esz (1975), 
ft is further argued that the sat of processes whose existence is necessary in 
order to explain our ability to interpret drawings, is also sufficient, when applied 
to the primal sketch, to explain the range of texture vision that is present in 
humans, Fourier and power-spectrum techniques on their own are certainly 
deficient, and probably also unnecessary, 

(4) The extraction of a form from the primal sketch using these techniques 
amounts to the figure-ground computation. Except in difficult cases, this extraction 
can proceed successfully without calling upon higher level knowledge, and it 
precedes the computation of the shape of the extracted form. This has two 
important consequences. Firstly, the isolation and delivery of a form to 
subsequent processes does not depend on being able to assign an accurate high- 
level description to ilj and secondly, because of this it is easy to compute rough 
descriptions of complex forms. This is probably essential for the fluency of 
subsequent analysis of shape- 

(5) The extent to which higher level knowledge and purpose influences the 
processing up to this stage is very limited. There Is at present no reason to 
believe that higher,level knowledge is needed to Compute the primal sketch at alii 
and its role in the extraction of form from the primal sketch can often be limited to 
deciding which form should be extracted. It is conjectured that in all cases, higfrer- 
level knowledge need be only weakly coupled to the processes that separated 
figure and ground, This relegates the use of higher level knowledge to a much 
ister stage than is found in current machine visi&n programs, end sirraileneouily 
confines much of its impact to influencing control, rather than Interfering with the 
actual date-processing that is taking piece lower down. 

Each step In the argument is treated in a separate section. 

Early Processing: computing the primal sketch 

The primal sketch consists of a primitive but rich description of 
the Intensity changes that are present In an image. This description connate of e 
aet of assertions, expressed in terms of a vocabulary of symbols end modifiers 
that are powerful enough to capture all of the inqjortant information in an intensity 
array. An example of such an assertion might be; 



{SHADING-EDuE (POSITION (34 4fc) (73 48)) 

(CONTRAST 34) 

(FUZZINESS 17) 

(ORIENTATION 0)) 

The first problem is how such an assertion may be competed — what 
measurements should one first make on an image, and how should those 
measurements be combined to enable the assertion to be made. 

To help us answer these questions* let us see what 
neurophysiology toils us. Simple cells in the cat make measurements upon on 
imago, and the nature of the measurement that they make is fairly well understood, 
Their receptive fields are either bar- or edge-shaped (HubeS and Wiecel 1962), 
and if other parameters are held constant, they signal the linear convolution of a 
bar- or edge-shaped mask with the intensity distribution currently falling upon the 
retina, in logarithmic units of contrast (Maffei and Fjorentini 1973, figure ft). Not 
aJl of what are now called simple cells behave linearly, but a distinct subclass 
does. The important question for understanding the analysis: of visual information! 
is whether these cells represent assertions Other than the tact of the 
measurement itself; and it they do, what are they? One idea is, for example* that 
a cell with a bar-shaped receptive field signals an assertion about the presence of 
a bar in the visual field; but a momcnlV thought reveals that this is impossible, 
since such cells respond also to the presence of a single edge. Another puzzle 
concerns the existence of both bar-shaped and edge-shaped recaptive fields (in 
different cells). Since bath kinds detect change? in intensity, why are both type* 
needed? The reason is probably that changes in intensity ere not the only 
important types of change in an image — changes in intensity gradient often 
provide Important, and sometime? lha only information that an object boundary is 
present (Marr 1974b). An, edge that consists Of a step change in intensity gradient 
rather than in intensity may be produced by a lamberlian white cube aligned at 45 
degrees to the viewer and illuminated from the viewing position. Perceptual 
evidence of our sensitivity to such edges is easy to find: Mach Bands ere the 
most well-known, example (see e.g, Ratliff 1965). This immediately suggests that 
one should regard simple cells that have an edge-shaped receptive field as 
measuring something like the first directional derivative of intensity;; and those with 
a bar-shaped receptive field as measuring the second directional derivative. Two 
questions then arise: firstly, why compute direction^ measures? And secondly, 
what should one do with the measurements when one has them? 

The application of a bar-shaped mask to an image does not, as we 
have seen, lead directly to an assertion about the presence of a bar in the innage. 
The underlying point concerns the relation between computing the bar assertion, 
and the inverse transform of the original measurement, and it is a point of some 
importance. Let us consider the computation of an assertion about the presence of 


FIGURE 1 
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l, The image ot a chair (la) hat been convolved with two "corner-masks" (lb and 
1c), The mask shapes are shewn in the figures. Detecting corners from such 
measurements is not straightforward. 























































































































































b comer in the image of figure la, A way of computing this assertion that 
immediately springs to mind is to taka e specially "tuned" corner-shaped mask. One 
might conjecture that a ''corner" exists in the imago at a point P provided that the 
mask gives a value there which is greater than some threshold. Figures lb and c 
show the convolution of corner masks with the image; but can the reader 
confidently distinguish the comer* from those measurements? The reason for the 
taiiure is that the inverse transform to that produced by a corner-shaped 
receptive field depends critically nn the boundary conditions that obtain. Any 
method that computes a corner assertion is saying something about this inverse, 
and so must take enough information into account at each point to satiety the 
dependence on boundary condilione. This extra information may be provided by 
looking at the results of the corner-mask at neighboring points, or by looking at 
the resiita of some other measurement taken in parade^ the important point is 
that the computation is not a trivia! one, and has to taka those extra factors into 
account. It is not impossible to use primary measurements that are not orientation 
sensitive, but Ihe extra computation involved is expensive, since one switches 
from having to look in just two directions to having to fook in all directions, A 
persuasive case would have to be made if one were to choose a primary 
measurement that was not directionally selective, 


Translating the measurements into a description 

Suppose then that one measures the first end second directional 
derivatives of intensity everywhere to an image. What do we do with them? 
Translating one large array of numbers into several other large arrays is not on 
obviously useful process. It turps oul, however, that we can make a great 
simplification at this stage in the analysis. Provided that measurements are made 
with masks of several sizes, one can show that the position* and sizes of the 
peaks in the measurements provide enough information to compule the description 
of the underlying intensity changes. Furthermore, provided that a group of peaks 
is sufficiently isolated Irom other peaks, the other peaks may be Ignored when 
analyzing that group, 

The reason for this is illustrated in figure 2, which shows the 
difference between edge-mask values obtained using masks of two Afferent size* 
on a 3tep change in intensity (2a), and an a gradual change (2b). The results era 
analogous to-the power spectra of different kinds of edge. Slep changes are 
seen equally well by ail size; of mask. Gradual changes are seen increasingly 
faintly by edge-shaped masks whose dimensions are smaller than the distance over 
which the intensity change is taking place. Figure 2c shows this effect in graphic 
form, and from it one can see that a good estimate of the "fuzziness" of an edge 
may be made by finding the mask size at which the edge-mask response stftfti to 


FIGURE 2 


2. Diagrams of “edge-shaped" mask convolutions with a step (a) and with a gradual 
(b) intensity change, The intensity profiles appear at the lop. The convolutions 
with the two sizes of mask shown cn the left appear beneath the intensity 
profiles. For a step change in intensity, masks of b'I sizes produce tha same 
maximum response (trace a in graph (c)). Gradual intensity changes are seen 
progressively weaker by the smaller masks (trace b in graph (e)), 
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This shows nre way in which the use of mail tipi e mask sizes is 
Important, but there is another reason which is perhaps even more important. It le 
that where a faint edge exists in the image, It Is frequently impossible to tall from 
a single record which of the peaks are important, and which are due to noise, 
Matching peaks obtained using different sizes of mask greatly aids the reparation 
of signal from noise. 

The process of computing the description may therefore be 
reduced to three operations: firstly, find the peaks in the measurements obtained 
from the convolutions of the imago with different sizes of mask,, and select the 
relevant peaks using the criterion illustrated in figure 2; secondly, separata the 
peeks into isolated groups and thirdly, parse the local configuration of peeks Into a 
descriptive element. A small number of classes of peak configuration suffices to 
cover the cases that can actually occur, and they ere illustrated in ligure 3. The 
figure shows typical combinations of peak patterns that occur in the outputs from 
edge-mask (upper records) snd from bar-mask {lower records) convolutions. 
Examples of the masks 'hat we use appear in figure 3a, The descriptor EDGE is 
used whan two peaks of about equal and opposite signs occur together in the bar- 
mask record {3b). If one b&r-mask peak is considerably smaller then the other, the 
edge is classified as an EXTENDED-EDGE (3c). Extended-edges are common where 
a convex boundary is illuminated from one side, Figure 3d shows an intensity 
gradient edge, and figm-e 3a corresponds to the presence of a thin LINE such ns 
can occur in the glare off an object's edge, or a very thin pencil stroke, Finally 
there are edges that begin and end gradually, and extend! over a relatively large 
distance; these are classified as SHADING-EDGEs (figure 31). In addition to 
descriptors of edge type, one can measure an edge's STRENGTH, POSITION, 
ORIENTATION, and FUZZINESS. This last parameter is computed by compering the 
amplitudes of the peaks obtained using masks of the same shape but different 
size*. (See figure 2, and Marr (I 974b) for the details), 

Figure 4 gives an example of an intensity distribution that hes 
been described by this process, and the legend explains which mask convolutions 
were used. One ol the assertions has been traced bock to the convolution 
profiles, and the arrows point to tihe peaks that gave rise to that particular 
assertion. The low-level vocabulary that is used here is not intended to be 
definitive, but some claim is made to the effect that It Is a good example of the 
genre, because it has sufficient expressive power to describe most kinds of 
shading adequately, and the method is simple and works reasonably well 
Experiments are being planned to determine whether the types of intensity change 
that are distinguished by these primitives are also perceptually distinct 


FIGURE 3 


a. 








3. Examples of edge- and har-masks appear in 3a. 3b - f give the classification 
that is described in the text of peak patterns in edge- and bar-mask convolution 
profiles. The primary visual processor uses these stereotypes to classify intensity 
changes in an image. 





















































































FIGURE 4 


4, The intensity distribution exhibited in 4a, whose protile appears, in 4b, was 
obtained by illuminating a curved piece e! white paper from one end, and viewing; 
it from above. Its descriplion, computed using an edge-mask of panel-width S (4c), 
and bar-masks of panel-widths 4 (4d) and A (4e) f is as follows; 

EDGE (POSITION 180) (AMOUNT 136) (FUZZ SHARP) 

EDGE (POSITION 31 2) (AMOUNT 3} (FUZZ 4) 

EDGE (POSITION 332) (AMOUNT 2) (FUZZ SHARP) 

EDGE (POSITION 535> (AMOUNT -3) (FUZZ 4} 

EDGE (POSITION 544) (AMOUNT ZB) (FUZZ 5) 

EDGE (POSITION 564) (AMOUNT 2) (FUZZ 4) 

EDGE (POSITION 590) (AMOUNT 1) (FUZZ 4) 

EXTENDED-EDGE (POSITION 6A2) (AMOUNT -12) (FUZZ 9) 

(tha peaks giving rise to this edge are marked with arrows) 

EDGE (POSITION 724 ; (AMOUNT -20) {FUZZ 6) 

EDGE (POSITION 77fi) (AMOUNT 3) (FUZZ 4} 

EDGE (POSITION 7B4) (AMOUNT -4) (FUZZ 4) 

SHAD;NG-EDGE (POSITION 670) (AMOUNT -14) (WIDTH 67) 

SHADlNG-EOGE (POSITION 491) (AMOUNT 4) (WIDTH 36) 

SHADING-EDGE (POSITION 439) (AMOUNT -B) (WIDTH 73) 
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FIGURE 5 


5. After description of intensify change? ho? occurred independently at each of S 
orientations, and after linear assembly of these descriptions has taken place, 
locally, the eight descriptions are combined. An example of the result obtained 
from: 5a appears in 5b. Short noise elimination then takes place* giving 5c. The 
asterisks denote places at which dir actions measures of contrast suddenly change, 
They are the precursors of termination assertions:- 
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Combining orientation-dependent descriptions 

We have seen how to compute an orientation-dependent 
description of the irtensity changes, and we now deal with the problems of 
combining local pieces of description from the same orientation, and of combining 
the descriptions obtained at different orientations. What then ore the Issues that 
are raised In combining the local analyses described in the previous section? 

The information that is used during this operation is primarily of 
two kinds: local consistency relations, which enable one to string local assertions 
toother} and Inert competition, between competing descriptions of the same 
phenomenon Obtained from masks at different orientations. Surprisingly, It turns 
out that the local consistency relations are more important than local competition, 
anti that local competition is required not go much between descriptions obtained 
from masks at nearly adjacent orientations, hut between the descriptions obtained 
from masks that are nearly perpendicular. 

Figure 5 illustrates the problems that arise. The image was first 
operated on at eight orientations with the process described in the last section 
Next, these local assertions have been glued along directions nearly parallel to the 
masks from which they were obtained An interesting feature of the process is the 
abundance of short segments perpendicular to the primary edge (figure 5b), These 
arise because of a combination of local noise, the image tessellation, and other 
irregularities in the image. They occur in every image we have processed. In 
dealing with them, one cannot dismiss in a cavalier manner all very short segments; 
tiny "blobs" in the image' also give rise to them, e? can be seen from the same 
image at coordinate (73, 75). But a "small" element like this can be ignored: if (a) 
it crosses a 'long" element, and (b) its contrast is less than that of the item it 
crosses. Figure 5c shows the results of renewing $mst| noise elements using this 
criterion 

The asterisks in the figure signify that the- contrast of the edge 
changes rapidly at that point, possibly becoming zero. They are the precursor Of 
assertions about the presence of terminations, but space forbids a discussion of 
them here (see Marr 1974c). 

The only other item of note in computing the primal sketch is the 
question of detecting local, smali blobs. Figure 5c at coordinate (73, 75) shows 
how they appear, and in fad we make small blobs a primitive element of the 
primal sketch, together with their associated Intensity" value, arid the sizes and 
orientations of their major and minor axes, finding these blobs Iron the g3u»d 
assertions depends a small amount on elegant programming, and a large amount on 
brute force. The reader may ask why do we detect blobs in this way; why not 
use a simple blob-detector like a mask with a centre-surround organization? The 
reasons are twofold. Firstly, when using a centre-surround mask to generate 
assertions, one has to bo very careful of the boundary condition problem 
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mentioned earlier. One can dl&vi&e parallel scheme? of the form "a blob exists at 
points P if the centre-surround mask gives an isolated peak there* and if there are 
no edges in the vicinity," but these are restively expensive to compute, and 
become unreliable if the blob is not very circular, or II there are indeed other, 
fainter or unrelated edges in the vleimty, It is Interesting In this connection to 
note that the phoaphenes produced by stimulating a point in area 17 -- an act 
which presumably stimulates orientation-sensitive cells at all orientations — 
commonly take the form of a bright point in the visual field (Brindley 1 970, p 124), 

The primal sketch differs from a simple feature-point array in ® 
rather subtle way, and as a model ot the inlormation-processing that is performed 
in area 17 it makes some definite and perhaps unexpected statements, Some 
examples will help to make this clear. One consequence Is that th# direct output 
of a linear simple cell is not available as an element In the primal sketch- Its 
measurement is Used to create an assertion about the presence of an ed|e 1 end 
that assertion is what is available. Creating the assertion is an act of computation 
— a sample one, since It Invokes little more than peak matching and the 
classification of a peak conligurationv but an act of computation nonetheless. The 
main point is that this has to go on. 

An interesting consequence of this is illustrated in figure £, 
Suppose that an imago contains two small close blobs. These blobs give rise to 
measurements by a number of sizes of mask -- some small ones represented by 
the tiny line segments* and some large ones, like the one that is illustrated. OneY 
e prior i inclination would be to believe that large "line-detector* would lire, and 
that this would have something to do with seeing the two blobs. This view 
amounts to supposing that simple cells write directly into a feature-point array. 

But if our theory is correct, although the large "simple celf“ may indeed fire, its 
measurement will not be used to compute the description of the two blobs 
because their sharp boundaries cause the associated intensity change to be 
described from peaks in the small masks. The el feet illustrated in figure 2 c wilE 
cause the descrip bon to be computed from the smaller masks unless the blobs are 
severely d-efocussed. [Compare also our failure to perceive L D. Harmon's coarsely 
sampled and quantized image of Abraham Lincoln, (Juiesz 1371, p.311)]. I mention 
this point because Juiesz (19?5, pp4Q-42) has concluded that in situations like 
this one, the output of large simple cells in this configuration plays no part in 
texture vision discriminations. We shell see the relevance of this shortly. 

The structure ol the primal sketch may be Summarized as follows: 

PSL The primary visual processor delivers a symbolic description of the intensity 
changes present in an image. This description iises the following primitives to 
describe intensity changes: 

(i) Various types ol E3GE 



m LINEs, or thin BARs, 

(ill) BLOBs 

The items (i) and (ii) have been assembled into straight segments, end short noise 
elimination has occurred. 

PS2. The following items are bound to each element of the description, 

(i) ORIENTATION 

(iij SIZE - length and width if both are defined, diameter if 
major and miner axes are equal or undefined. 
im INTENSITY (LIGHTNESS). 
flv> POSITION. 

(v) TERMINATION POINTS.. 


What drawings tell us 

In order to make the second step of my argument. I must digress 
awhife on the manifest variety of ways in which we can interpret simple pencil 
drawings that lack semantic content The point f wish to make is lhat from our 
ability to interpret certain kinds of drawengs, we can infer with same confidence 
that cerlain kinds of symbolic process musl exist in our visual systems. Let us 
take an extreme example ( rst. In figure 7a there is lillle doubl that some process 
somewhere is creating a circular contour, and that the "places" in the image lhat 
are giving rise to that contour are the inner ends ol Ih# radial lines. One comet 
argue that Fourier detection methods will produce it lor one, because it really is 
not there. This contour is not being detected, it is being constructed. Figure 7b 
shows another example in which "ends ol things" are being formed into a 
perceptually vivid contour. 

From these two rather strong examples, we see lhat abstractly 
defined places in an snag# can be assembled into contours that have a definite 
perceptual existence, despite the absence of apparent semantic content In the 
image. If one approaches these phenomena from a computational point of view, it 
is netural lo think of this process as occurring in lwo steps. Firstly, certain tWinga 
in drawings can cause places to be defined in some abstract Mnse. Secondly, 
"places 11 , once defined, can be aggregated in various ways. 

Having resided this, one immediately wants to know in what ways 
places actually can bo defined, and in how many different ways they can be 
ajgregated. A batter feel for the problem can be gained by looking at the rest of 
figure 7, and at figure ft. Wo are forced to conclude that ''places'* may carry 
intrinsic orientation information, snd that this orientation information may or may 
not be used (figures 8d and 7c). indeed these two situations con occur in the 
same figure (7a). 


FIGURE 6 



6, The cilferance between the primal sketch and a feature-point is brought out by 
the Image 6a. A measurement taken with a large mask <6tn> could generate a 
feature-point, but it would not be used in the computation of the primal sketch 
1 hi* is because the sharp contrast changes force the use ol measurements from 
Small masks {6c). 
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7- These drawings provide evidence for the action of several symbolic processes 
during our perception of them. In particder, the circular ''contour" in 7 0J and the 
linear one in 7b P are being constructed, not delected 




















We see from these exempt* that the aggregation of places can 
occur in two broad ways; cluttering into groups that often have computable 
boundaries, and the assembling of place? into curves or fines, which ! call 
curvilinear aggregation, in the case where there is an orientation associated with 
the place, aggregation can either use or ignore it. If the orientation is used, there 
ara two possible ways: the aggregation con either follow the intrinsic orientation, 
or it can proceed in a fixed orientation relative to it (figure Sc). If the number of 
places involved is very small (less than 5 say), the places may form a standard, 
named configuration (see figure 9) which is evidently described relative to an axis 
which is imposed on the figure, and whose default value is the vertical. 

Interestingly, procedures lor implementing each aggregation 
technique ara quite straightforward, They have a common flavor; a mixture of ii 
simple I oca 1 process operating everywhere over the image, together with a 
sensitivity to, and the ability to generate, one or two straightforward gin bat 
measures, To give you an idea of their simplicity, I shall outline one ol them* which 
we call theta-aggregation. Theta-aggregation is the process by which oriented 
itern? are aggregated In a direction that Offers from their Intrinsic orientation- The 
difficult part about it arises because measures of the ''overlap" of two Oriented 
tamo depends upon the angle, that a, that the final aggregate makes with each local 
unit (see figure 10), So theta determines the aggregation process, but also 
depends upon it. For good data, it may be quite unnecessary to know theta place 
aggregation that Ignore? theta will suffice to compute the aggregate. In general, 
however, one will need to take theta Into account, as we shall shortly see, 

Viewed from a very abstract level, this computation may be regarded as a process 
of solving a targe number of rather simple equations. In practice, a network with 
feed-back will solve it, where the information being fad back is theta. We have 
implemented an iterative version of this process, and some results are displayed 
later oix 

In summary then, the argument of this section has been that our 
ability to interpret Certain simple drawings shews that we can bring certain highly 
symbolic processes to bear on the analysis ol drawings whose semantic content ii 
small, I summarize the processes that appear to be available below, even though 
space has not permitted mention of several of them. 

PLACES may be defined by: 

(Pi) The position of a btab, or of an edge or line that ia not too long. 

(P2) The end of an edge or line that is not too shorl, or of a blob with long major 
axis end short minor axis, 

(P3) A small aggregation of placet 

The definition is slightly recursive This (s ta be expected, since the assertions 


produced by one aggregation process are presumably written into the Same Mtiv* 
geometrically organized storage processor as is the primal sketch The precise 
boundary between "too long" and "too short" can be lelt to individual taste, 
because near it, both definitions will usually lead to the same aggregations. The 
boundary needs to be in the region of 0.5 to 1 degrees of arc at fovea! resolution. 

AGGREGATION may proceed in the following ways; 

(1) Clustering nearby places, using the methods about as complex as 91 or 32 of 
Jardine & Sibson U 97]), but which ere sensitive to globe! parameters of size end 
average density. Clustering facilities that appear to have about this complexity can 
operate on patterns of dots in most human visual systems {see e.g. Julesz {I 971 
pp 105ff), or recently □’Callaghan (1974a)).. 

(2) Curvilinear aggregation: aggregation that has a (local) orientation, and which 
produces contours by joining nearby, aligned places. It is probable that only first 
and second nearest neighbors need be considered by the local components of 
these processes, but some global information is also generated and used [see 
O’Callaghan (! 974a and b) (or access to recent literature on dot-grouping studies, 
and Marr <)37S)j 

O) Theta-aggregation, the grouping of local, similarly oriented Items in a direction 
that differs from the intrinsic orientation, but in a manner which uses it. 

(4) If the number of places is small (< 5], the conligurallotl formed by the places 
may be described relative to some specified axis by moans of a special 
configuration datstructure (See Merr 1976). 


Global Measures on the Primal Sketch 

Before the digression of the last section, we had reached the 
point of defining the Primal Sketch, and of showing how to compute most of the 
quantities in it. We also saw the primal sketch of a very straightforward image, of 
a cylinder. The primal sketch is rarely as simple as that, however. Figures 12 and 
13 contain examples of the primal sketches of more complex images and, as one 
might expect, they are in general large and unwieldy collections of data, 
Furthermore, it is difficult to See how the complexity of the primal sketch coudd be 
an artifact of our particular choice of primitives: images really are complex in this 
way. ' 

The unwieldy nature of the primal sketch it therefore something 
with which we have to live, end turn to our advantage if possible. The 
fundamental problem of the next stage of the analysis is simply stated: how do 
we select out from the primafl sketch (hose regions thet should be treated as unit 
forms by subsequent descriptive processes! end is it possible to do this without 


FIGURE 8 






B. These drawings exhibit aggregation processes that take come account of the 
orientation present at the a|gregated places. 































































complex interactions between the primal sketch a rtf higher-level knowledge? In 
perceptual terms, the computational problem that we must now address 
corresponds to distinguishing between figure and ground, and it h strongly related 
to the problem of texture vision (Julesz 1971}, 

^ rom wi abstract point of view, the primal sketch la simply a 
large body of data. There is therefore no difficulty in extracting from it certain 
simple global measures and statistics In particular, we shall assume that the 
following measures are automatically available from any primal sketch: 

MEASURES taken over moderately sized regions (0.5 to 10 degrees et foveal 
resolution) of the primal sketch; 

MO. The total amount of contour, and number of blobs, at different contrasts end 
intensities. 

Mt. QR1EJYT.ATI0N: the total number of c ements at each orientation, and the total 
contour length at each orientation -- the orientations being divided into about 12 
diacriminabla buckets. Detection of the existence of one, two, or three 
predominant orientations* and the recognition of distributton* that have substantial 
amounts of contour in more than three orientations. 

M2, SIZE; measurement of the moan and variance of the size parameters defined 
in the primal sketch. 

M3, INTENSITY; measurement of the mean and variance of the lightness of items 
in the primal sketch. 

M4. SPATIAL DENSITY: mean and variance of the nearest neighbor distances, and 
possibly the mean second-nearest neighbor distance. There is r» computational 
problem In obtaining these measures. 


Texture Vision 

There are three parte to the problem of texture vision. Hew 
does one discriminate between textures, and hence form regions from texture 
differences? How does one describe the shapes and dispositions of the regions so 
obtained’’ And finally, how does one interpret a texture, in the sense of 
understanding the structure Ol the surface I hat gave riae to it? Only the first of 
these will bs dealt with hers. 

There ere several current ideas on texture processing. Some 
authors have used Fourier techniques, and in certain circumstances, the spatial 
power spectrum can successfully separate different regions (Bac)sy 1972). Others 
have constructed specialised operators which when applied to an image sometimes 
discriminate between regions with different texture. Probably the earliest 
example of this was the Roberta gradient (Roberts 1963). The most interesting 


FIGURE 9 
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9, Examples of ''standard coofi gyrations" that wa have found it useful to recognise. 
The reader will probabfy perceive them revive to a vertical axis, The V£E 
shown in is used in figure 15$, 






























FIGURE 10 



10. The Pleasure c/ the overlap of two adjacent, per all el lines depends on on 
external angle, theta, In 10a,. theta is SO degrees, which is Ihe value at which 
Iteration be£in$L 


and comprehensive propped is duo to Jdesz, Friwt % Gilbert and Shepp 0973), 
[see also Julesz (1975^ who showed that visual textures that dilfer only in thalr 
third or higher order statistical structure are rerely perceptually difrcriminablaf 
whereas visual textures that differ in Iherr first or second order statistics con 
almost always he distinguished. The important point about this finding lies in its 
demonstration of the essential si duplicity of texture processing. Although it gives 
no irtsight into the exact nature of the processing, it does imply that ail coefficient* 
of third and higher-order terms in its Volterra series expansion are zero. 

We have now reached the core of this article. We saw in the 
lest section that certain computational facilities exist and are deployed during our 
reading of certain kinds nf drawings, The facilities were summarized es processes 
Pl™3 and Alt-4 on page 14. It is* of course, possible that their existence i$ no 
more than a happy accident, which fortuitously allows us to interpret the idle 
seribblings of the artistically gifted The central thesis of this ar ticle 1$ that these 
processes are available pr ecisely be ca use they a re needed to"hel p interpret the 
primal sketchy and f urthermore th al th ose symb olic processes, tocether with frrst- 
pr der discrim ina tions based o n the measures MO-4 de fined on pare I5 t are 
suffjciert t to account t or the ra .-es o f texture di-scriminalions of which w e are 
capable, with m the class of images to which this article is restrict ed In other 
words, texture vision is actually implemented not by second-order operations on 
the image, but by first order discriminations, together with a small number of 
grouping operations, acting on the primal sketch of the image. Julesz (1975 p43) 
mentioned in an aside the possibility that texture vision may rest on "first-prefer 
statistics of various simple feature extractors", but this idea requires the concept* 
of the primal sketch and of the aggregation primitives before it can be brought to 
fruition. 

So that the reader may form an intuitive grasp of the central 
thesis, let us re-examine two of the textures devised by Julesz, and fellow this 
with some examples of the texture analysis rim on the images whose primal 
sketches we saw earlier. Firstly, consider figura II, Julesz notes that in Ha, the 
two regions have distinct second-order statistic*!, but not In figure I lb- Hence, 
according to hrs rule, the two region* are distinguishable in 1 la, but not in I lb. 

Now consider our new explanation of this, Orientation measures are the onfy 
distinguishing feature of the primal sketch representation, because everything else 
has carefully been held constant. In 1 lb, the two basic elements are related by a 
130 degree rotation, and so the orientation statistics to which they gave rise are 
identical. Hence the two regions are indistinguishable. In l la however, there is 
more contour at Q degrees than el 90 degrees in the central patcK but the 
opposite is true in the surround. Hence the two regions are immediately 
distinguished. 

The second example appears ss figure lie. Soma of the models* 
in the pattern have been reflected about a vertical line through their centers. 





















11. Examples of textures Revised by Jule-sz. AM three contain a square region 
which differs from the background, but only in 11a fs it immediately discernable. 
The theory provides an explanation of ell of them. 
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COAftSl IMAGE DESGfiEFTGRS 
fused. In primary control of texture Analysis) 
Orientation Bucket? are 15 a wide 
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12. i 2a gi-ves a rendering of the primal sketch af the image e( figure Ip. I£k> 
chows some measures mptfe on it. Theta aggregation, has decoded the texture that 
h present, and the aggregates are d>,sprayed as the mosaic 12c. 
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13. 13b shews $ rendering of the primal skatcb cf 13a. 13o give* ttie assocratsd 
orientatiorii--dependent statistics. The predominance &( items pI tO degrees causes 
thetaggregation to be altem^ed at this orientation. The default setting of theta 
produces the aggregate i3d From this, theta is found!, awl Ihe aggregation process 
then extracts the stripes successSutly {13e - i>. 



























Their second-order statistics are therefore different- This is an example in 
which Julesz : s generalization lails. 

The statistics of the orientations of the contours are however unchanged in 
this particular instance, because Orly vertical and horizontal orientations ere 
involved. Hence the present theory predicts that the two regions are in fact 
indistinguishable. 

Hnw let us look at some real images. Figure !2a shows the 
primal sketch of the chair whose image appeared as figure la, and figure 12b 
gives some ot its orientation statistics. The first thing to realize about this image 
i$ that it is textured at all. The texture is so simple that one easily overlooks ft 
Vet the texture exist* in exactly the sense of this article, and the process that 
succeed* in decoding it is theta-aggregation Figure 12c shows the results of 
running the theta-aggregation procedures on this image, end each element Ift the 
mosaic contains just one aggregate. 

We see from this example a glimmer of the power of texture 
vision. Using one knowledge-free technique, we have separated the chair from it* 
background, and also separated the problem of divining the overall three- 
dimensional shape of the chair from the analysis of its surface properties. Each of 
the aggregates can be described simply by position orientation, and extents and 
this produces 0 skeleton of the outline of I he chair. By considering separately the 
structure of just one aggregate, One could go on to compute a description of the 
surface structure of the material out of which the chair is made. 

The next example chows a more difficult case oi theta- 
aggregation. The image Is taken from Brodatz (1972, plate Dill, and the intensity 
values are shown in 13a. Figure 13b shows an approximation to the primal sketch* 
Contours of all intensities, lengths, end orientations are shown, and as one would 
expect from an image of this complexity, 3 3b has a somewhat messy appearance. 
Figure 13c gives statistical information about this image, from which It is evident 
that items at an orientation of around 60 degrees ere strongly predominant. The 
average length of items at this orientation is 13. These coarse measures cause 
the texture analyzer to attempt to group the edges at this orientation, Initially, 
the direction in which grouping should take place is unknown, so a default of 150 
dogs {•* 60 + 30) is assumed, and stringent grouping parameters are used This 
leads to the primary cluster shown in figure 13d. From this, the correct direction 
is obtained (—SB degs), and the cluster process then groups the items into Ihe 
stripes shown in I3e, f, g, h, and i. This completes primary texture processing- 
Qnce the primary stripes have been obtained, another stag# ol theta-aggregation 
serves to relate the stripes to one-another. Notice that in this image, some of the 
atrip# information has been picked up directly from intensity values (see figure 
13b), This would not be true of e more herring-bone texture, and the anblysie 
does not depend upon it. Our present system is successful at processing herring¬ 
bone textures of similar complexity In which the two types of atripe have the 


figure 14 



14 Curvilinear aggregation operating on the primal sketch shown in figure 5e 
produced the dements 34a, b*-t Once larger units have been obtained, the 
governing parameters can be relaxed, and the elliptical form fl4d) is obtained. At 
this pomt f the system is unaware pi its shape. 

















15, This image of a lay bear (l 5a) has the primal sketch illustrated in 15b, The 
three principal forms extracted from !5b appear in, 15c, d $ e. The items in 15# 
are classed as BLOBs, and the conliguration that they form is recognised as a VEjE 
{figure 90 with modifier FLAT, The axis relative to which this description was 
Computed is the vertical {default value). 















































































S-Snie average reflectance. 

Next, we give an example of a simple kind of curvilinear 
aggregation. The local elements ol the primal sketch of the cylinder shown in 
figure 5 are grouped using tight, conservative technique* into the units shown in 
figure 14a* 14b, and 14c. These are then gathered! using slightly weaker 
constraints into the form shown in 14d, Notice that the contrast serosa the top- 
left portion of the form has the opposite sign from the contrast elsewhere. 
Curvilinear aggregation depends on local information about how well two adjacent 
segments matchi and on global information that Includes for example whether the 
complete form is closed. The global measures can affect the local choice of 
segment in those infrequent cases where no candidate is to be preferred on 
purely local grounds (see Marr 1976). 

Finally an example of several type* of analysis appears in the 
image of a toy bear (figure 15a). The primal sketch appears in i5b. The contours 
of has face and muzzle appear in 15c and 15dj* and the three blobs that come from 
buttons that stand for his eyes and nose appear In 15e, The three blobs define 
three pieces, which in turn provoke a specific configuration description relative to 
the default axis, which is the vertical 

The examples given here do not prove the central thesis of 
article. This will need to be tested by experimenting with considerably more 
images than the twenty or so with which we hav* dealt hitherto. But they give us 
grounds for believing it to be a reasonable theory of the computet!onal mechanisms 
that underlie texture vision and the separation of figure from ground. A more 
complete report is.in preparation (Marr 1376). 


The influence qf higher-level knowledge and of purpose an visual 
information processing 

Perhaps the most novel aspect of these ideas is the notion that 
the primal sketch exists as a distinct and circumscribed symbolic entity* computed 
autonomously from the image* and operated on by a number of focal geometrical 
processes, semi-local measures* and first-order discriminations. In a computational 
sense* the primal sketch is a very active structure. The information written into it 
depends on the image, but lurking active in Its fabric lie several highly abstract 
geometrical and statistical processes. It is the direct analog for the class of images 
studied hete of the Cyclopean retina that Juiesz (1971) wrote of for binocular 
vision. More subjectively* it corresponds very closely to the ’‘image** that o-ne is 
conscious of. This reflects- the computational hypothesis that all subsequent 
analysis reads the primal sketch, nut the data from wrtiich It was computed. The 
primal sketch therefore acts in a genuine sense as the interface at which visual 
analyse becomes a purely symbolic affair. 


If it turns out to be true that texture visor is successfully 
implemented by approximately the set of processes that has been defined in this 
article, it will mean that visual "form** can usually be extracted from the image by 
using knowledge-free techniques. In other words, the extraction of a visual form 
can usually precede its description. From this it follows that it is usually easy to 
compute a Coarse description of a form. 

it is difficult to overstate the importance of this for determining 
the structure of subsequent recognition processes. It means that one can see the 
shape of the forest without first caunputing detailed descriptions of all the trees* 
that one can compute the cluster of blebs that forms a distant village 
independently of deciding that some of those blobs are actually buildings, in the 
more mundane example of figure IS, one can compute that the overall shape of 
the top form is roughly evoids! without first having to segment out and describe 
separately the bumps that are the bear’s ears. Furthermore, It suggests that the 
role of higher level knowledge in this process is not only vary restricted, but is 
also different In kind from its intervention In programs like Shirai T s (] 973). It does- 
not affect the line-finding stage (the computation ol the primal sketch) at elf. ft* 
most usual modus operand! is in choosing which processes are to be used fo read 
the primal sketch — for example by specifying which texture predicate should be 
used on the image to sated the parts of current interest. It can also apply certain 
limited kinds of flags to critical segments during their aggregation Into forms. The 
coupling between higher-level knowledge and the form-extraction processes is 
however much weaker than the coupling between the different form-extraction 
processes. 

It is dearly desirable to have some control over which of the 
possible forms in a figure should be delivered at a given moment from the primal 
sketch. For example, in the image BEAR there are three possible major forms* the 
outline of the head, the muzzle, and the shreo blobs that represent his eyes and 
nose. It seems probable that oniy one of these should be made available at a time, 
and this in turn raises interesting questions about the order in which it is done, the 
way in which the three forms and their relative positions are described, and the 
way in which those descriptions trigger a larger datestructure and are absorbed by 
It, in living systems, which are powerful enough (o operate in real time, the 
control of the direction of gaze may be rather closely related to the order in which 
these events take place. 
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